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Abstract —Recommender systems, medical diagnosis, network 
security, etc., require on-going learning and decision-making in 
real time. These - and many others - represent perfect examples 
of the opportunities and difficulties presented by Big Data: the 
available information often arrives from a variety of sources and 
has diverse features so that learning from all the sources may be 
valuable but integrating what is learned is subject to the curse of 
dimensionality. This paper develops and analyzes algorithms that 
allow efficient learning and decision-making while avoiding the 
curse of dimensionality. We formalize the information available 
to the learner/decision-maker at a particular time as a context 
vector which the learner should consider when taking actions. 
In general the context vector is very high dimensional, but in 
many settings, the most relevant information is embedded into 
only a few relevant dimensions. If these relevant dimensions were 
known in advance, the problem would be simple - but they 
are not. Moreover, the relevant dimensions may be different for 
different actions. Our algorithm learns the relevant dimensions 
for each action, and makes decisions based in what it has learned. 
Formally, we build on the structure of a contextual multi-armed 
bandit by adding and exploiting a relevance relation. We prove a 
general regret bound for our algorithm whose time order depends 
only on the maximum number of relevant dimensions among 
all the actions, which in the special case where the relevance 
relation is single-valued (a function), reduces to in 

the absence of a relevance relation, the best known contextual 
bandit algorithms achieve regret 0 (t( £,+ 1 1/(- d + 2 )) ) where D is 
the full dimension of the context vector. Our algorithm alternates 
between exploring and exploiting and does not require observing 
outcomes during exploitation (so allows for active learning). 
Moreover, during exploitation, suboptimal actions are chosen 
with arbitrarily low probability. Our algorithm is tested on 
datasets arising from breast cancer diagnosis, network security 
and online news article recommendations. 

Index Terms —Contextual bandits, regret, dimensionality re¬ 
duction, learning relevance, recommender systems, online learn¬ 
ing, active learning. 

I. Introduction 

The world is increasingly information-driven. Vast amounts 
of data are being produced by diverse sources and in diverse 
formats including sensor readings, physiological measure¬ 
ments, documents, emails, transactions, tweets, and audio or 
video files and many businesses and government institutions 
rely on these Big Data in their everyday operations. (Particular 
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applications that have been discussed in the literature include 
recommender systems Q, neuroscience j4), network monitor¬ 
ing (5j, surveillance (6), health monitoring (7), stock market 
prediction, intelligent driver assistance j8), etc.) To make the 
best use of these data, it is vital to learn from and respond 
to the streams of data continuously and in real time. Because 
data streams are heterogeneous and dynamically evolving over 
time in unknown and unpredictable ways, making decisions 
using these data streams online, at run-time, is known to be 
a very challenging problem a coi. In this paper, we tackle 
these online Big Data challenges by exploiting a feature that 
is common to many applications: the data may have many 
dimensions, but the information that is most important for any 
given action is embedded into only a few relevant dimensions. 
In general, these relevant dimensions will be different for 
different actions and are not known in advance - so must 
be learned. We propose and analyze an algorithm that learns 
the relevant dimensions for each action, and makes decisions 
based in what it has learned. 

Our structure builds on contextual multi-armed bandits. 
We formalize the information obtained from the data streams 
(perhaps after pre-processing) in terms of “context vectors”. 
Context vectors characterize the information contained in the 
data generated by the process the learner wishes to control/act 
on such as the location, and/or data type information (e.g., 
features/characteristics/modality). The decision maker/learner 
receives the context vector and takes an action that generates 
a reward that depends (stochastically) on the context vector. 
Contexts, actions and rewards are generic terms; the specific 
meaning depends on the specific Big Data application. For 
instance, in a network security application 0, contexts are 
the features of the network packet, actions are the set of 
predictions about the type of network attacks and the reward is 
the accuracy of the prediction. In a recommender system 0, 
contexts are the characteristics (age, gender, purchase history, 
etc.) of the user, actions are items and the reward is the 
indicator function of the event that the user buys the item. 
The problem is to leani the rewards (or the distribution of 
rewards) generated by each action in each context. The context 
vector is typically high dimensional but in many applications 
the reward for a particular action will depend only on a few 
most relevant of these dimensions, embodied in a relevance 
relation. For an action set A and a type (dimension) set T>, 
the relevance relation is given by IZ = {R(«)}„c_ 4 , where 
lZ(a) C V. However, whether this is the case and if so, which 
dimensions are most relevant for a particular action, is not 
known in advance but must be learned, and decision-making 
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must be adapted to this learning process. 

Relevance relations arise naturally in many practical appli¬ 
cations. For example, when treating patients with a particular 
disease, many contexts may be available - the patients’ age, 
weight, blood tests, imaging, medical history etc. - but often 
only a few of these contexts are relevant in choosing/not 
choosing a particular treatment or medication. For instance, 
surgery may be strongly contra-indicated in patients with 
clotting problems; drug therapies that require close monitoring 
may be strongly contra-indicated in patients who do not 
have committed care-givers, etc. Similarly, in recommender 
systems, a product recommendation may sometimes depend on 
many characteristics of the user - gender, occupation, history 
of past purchases etc. - but will often depend only (or most 
strongly) on a few characteristics - such as location and home- 
ownership. 

Relevance allows us to avoid the curse of dimensionality: 
we show that regret bounds depend only on the number of 
relevant dimensions, i.e., D re \ - which is typically much less 
than the full number of dimensions. Our main contributions 
can be summarized as follows: 

• We propose the Relevance Learning with Feedback (RE¬ 

LEAF) algorithm that alternates between exploration and 
exploitation phases. For the general case when D re \ < 
D/2, RELEAF achieves a regret bound of ()(T'^ 
where g(D re \) < (2D le \ + 3)/(2£> re i + 4), which reduces 
to a regret bound of when the relevance 

relation is a function. 

• We derive separate bounds on the regret incurred in 
exploration and exploitation phases. RELEAF only needs 
to observe the reward in exploration phases and hence, 
when observing rewards is costly, active learning can 
be performed by controlling reward feedback. RELEAF 
achieves the same time order of regret even when observ¬ 
ing rewards is costly. 

• The operation of RELEAF involves a confidence pa¬ 
rameter, chosen by the user, which can be arbitrarily 
small. If confidence <5 is chosen, then RELEAF will 
never select suboptimal actions in exploitation steps with 
probability at least 1 — 5. This provides performance 
guarantees, which are important - perhaps vital - in many 
applications, such as medical treatment. 

The rest of the paper is organized as follows. Related work 
is given in Section [TT] The problem is formalized in Section 
ED An algorithm that learns the relevance relation between 
actions and types of contexts is given in Section [TV] Then, the 
regret bounds are proved for this algorithm. Numerical results 
on several real-world datasets are given in Section [V] Finally, 
conclusions are given in Section [VT] 

II. Related Work 
A. Multi-armed bandits 

Our work is a new contextual bandit problem where rele¬ 
vance relations exist. Contextual bandit problems are studied 
by many others in the past HD-CS). The problem we consider 

*0(-) is the Big O notation, O(-) is the same as O(-) except it hides terms 
that have polylogarithmic growth. 
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in this paper is a special case of the Lipschitz contextual bandit 
problem m, tm where the only assumption is the existence 
of a known similarity metric between the expected rewards 
of actions for different contexts. The strengh of this model 
comes from the fact that there are no stochastic assumptions 
made on the context arrival process, and the benchmark which 
the regret is defined against selects the best action for each 
context. It is known that the lower bound on regret for this 
problem is 0(t( I)+ 1 V(- d + 2 )) lfl3l . and there exists algorithms 
that achieve 0(j 1 (- D + 1 )/(- D + 2 )) regret fl3ll . Compared 

to these works, RELEAF only needs to observe rewards in 
explorations and has a regret whose time order is independent 
of D. Hence it can still learn the optimal actions fast enough 
in settings where observations are costly and the context 
vector is high dimensional. For instance, in Section HV-DI we 
show that the regret of RELEAF is better than the bound of 

Q ( T ( D + l )/( D + 2 )j in fj4| for Drc| < D / 2 _ 1. 

Another class of contextual bandit problems consider reward 
functions that are linear in the contexts a, Ha. Due to 
this linearity assumption learning reduces to estimating the 
parameter vector corresponding to each arm, hence the regret 
bounds do not depend on the dimension of the context space. 
Several papers El, ESI impose stochastic assumptions on 
the process that generates the contexts and the arm rewards. 
For instance assuming that the contexts and arm rewards are 
generated by an unknown i.i.d. process, regret independent of 
the dimension of the context space can be achieved. 

The differences between our work and these prior works are 
summarized in Table Q] 

B. Dimensionality reduction 

Dimensionality reduction methods are often used to find 
low dimensional representations of high dimensional context 
vectors (feature vectors) such that the information contained in 
the low dimensional representation is approximately equal to 
the information contained in the original context vector C3- 
For instance, reduced-rank adaptive filtering m-m first 
projects feature vectors onto a lower dimensional subspace, 
and then adaptively adjusts the filter coefficients over time. In 
these works a low dimensional representation of the feature 
vector is learned based on the available data. Compared to 
this, in our work the relevant dimensions for each action 
can be different, hence a low dimensional representation that 
contains information about the rewards of all actions may not 
exist. An example is a relevance relation 7 Z for which each 
action only has few relevant dimensions, i.e., D le \ « D, but 

U aeA n(a) = D. 

2 The bounds in isi, tm are given in terms of covering and zooming di¬ 
mensions of the problem instance, but they reduce to the Euclidian dimension 
for the set of assumptions we have in this paper. 
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C. Learning with limited number of observations 

Examples of related works that consider limited obser¬ 
vations while learning are KWIK learning EH . Il22l and 
label efficient learning I23l - ll25l . For example, ll22l considers 
a bandit model where the reward function comes from a 
parameterized family of functions and gives a bound on the 
average regret. An online prediction problem is considered in 
Il23l - |[25l . where the predictor (action) lies in a class of linear 
predictors. The benchmark of the context is the best linear 
predictor. This restriction plays a crucial role in deriving regret 
bounds whose time order does not depend on D. Similar to 
these works, RELEAF can guarantee with a high probability 
that actions with suboptimality greater than a desired e > 0 
will never be selected in exploitation steps. However, we do 
not have any assumptions on the form of the expected reward 
function other than the Lipschitz continuity. 

For the special case when actions correspond to making 
predictions about the context vector (which is equal to the data 
stream for this special case), our problem is closely related to 
the problem of active learning. In this problem, obtaining the 
labels is costly, but the performance of the learning algorithm, 
i.e., rewards, can only be assessed through the labels, hence 
actively learning when to ask for the label becomes an 
important challenge. In stream-based active learning l26l - [|29| , 
the learner is provided with a stream of unlabeled instances. 
When an instance arrives, the learner decides to obtain the 
label or not. To the best of our knowledge there is no prior 
work in stream-based active learning that deals with learning 
relevance relations with sublinear bounds on the regret. 

D. Ensemble learning 

Numerous ensemble learning methods exists in the literature 
0 , m-m. These methods take predictions (actions) from 
a set of experts (e.g., base classifiers), and combine them with 
a specific rule to produce a final prediction (action). After the 
reward of all the actions are observed, the rule to combine 
the predictions of the experts is updated based on how good 
each individual expert had performed. The goal is to learn 
a combination rule such that even if the predictions’ of the 
individual experts are not very accurate, the final prediction 
is accurate because it takes into account the “opinions” of all 
experts. 

To evaluate the performance of ensemble learning methods 
analytically, the benchmark is usually taken to be the expert 
that achieves the highest total reward. Hence the “quality” of 
the regret bounds depends on the “quality” of the experts. 
In contrast, our regret bounds are with respect to the best 
benchmark (that only depends on context arrivals and reward 
distributions), and can be applied to settings without experts. 
Moreover, our algorithms work for the bandit setting, in which 
after an action is chosen, only its reward is revealed to the 
algorithm. 

III. Problem Formulation and Preliminaries 
A. Notation 

For a vector x, x, denotes its ?'th component. Given a 
vector v, x v := {xi}i^ v denotes the components of x whose 


positions are in v. The time index is t = 1,2,.... When 
referring to a time dependent variable we use subscript t as the 
rightmost subscript corresponding to that variable. For instance 
x t denotes a vector at time t, x t j, denotes its ith component 
at time t, and x„ t denotes the vector of its components that 
are in v at time t. 

B. Problem formulation 

A is the set of actions, D is the dimension of the con¬ 
text vector, V := {1,2is the set of types, and 
^ = {’^•(o)}aGA : A —> 2 V is the (unknown) relevance 
relation, which maps every a £ A to a subset of V. We 
call D le i = max a6 yi \IZ(a)\, the relevance dimension. When 
U re i = 1, we say that 72. is a relevance function. Elements of 
T> are denoted by index i. Let Vk , 1 < K < D be the set of 
I\ element subsets of V. We call v £ Vk, a AT-tuple of types. 

At each time step t = 1,2,..., a context vector x t arrives 
to the learner. After observing x t the learner selects an action 
a £ A, which results in a random reward r t (a, xf). The learner 
may choose to observe this reward by paying cost co > 0. The 
goal of the learner is to maximize the sum of the generated 
rewards minus costs of observations for any time horizon T. 

Each x t consists of D types of contexts, and can be written 
as x t = (xi it ,X 2 ,t, ■ ■ ■ : %D,t) where x i>t is called the type i 
context. Xi denotes the space of type i contexts and X := 
X\ x X 2 x ... x X/) denotes the space of context vectors. At 
any t, we have x*,* £ X,; for all i £ T). All of our results hold 
for the case when X, is a bounded subset of the real line. The 
number of elements in X, can be finite or infinite. For the 
sake of notational simplicity we take X; = [0,1] for all * £ V, 
since the values of context can be rescaled to lie in this range. 
Then, for the case when the actual context space is finite, [0,1] 
will be a superset of the context space. For a context vector x, 
x K(a) denotes the vector of values of x corresponding to types 
1Z(a). The reward of action a for x = (xi, X 2 ,..., Xd ) £ X, 
i.e., rt(a,x), is generated according to an i.i.d. process with 
distribution F(a,Xji(a)) with support in [0,1] and expected 
value fj,(a,x-ji( a ))- The learner does not know F(a,x-ji( a )) 
and p(a,x-ji( a )) f° r a £ A, x £ X a priori. 

The following assumption gives a similarity structure be¬ 
tween the expected reward of an action and the contexts of 
the type that is relevant to that action. 

Assumption. (The Similarity Assumption ) For all a £ A, 

x,x' £ X, we have \p.(a, x n ^ a) ) - p.{a, x' n{a) )\ < L\\x n(a)~ 
x 'iz(a)\\’ where L > 0 is the Lipschitz constant and || • || is the 
Euclidian norm. 

We assume that the learner knows the L given in the 
Similarity Assumption. While we need this assumption in 
order to derive our analytic bounds on the performance 
of the algorithm, as it is common in all contextual bandit 
algorithms HD, m, our numerical results in Section m 
show that the proposed algorithm works well on real-world 
data sets for which this assumption may not hold. Given a 
context vector x = (xi, X 2 ,..., Xd), the optimal action is 
a*(x) := argmax ag _4 p(a, x^a))- In order to assess the 
learner’s loss due to unknowns, we compare its performance 
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with the performance of an oracle benchmark which knows 
a*(x) for all x £ X. Let pt{o) := p(a,x K ( a y t ). The action 
chosen by the learner at time t is denoted by a f . The learner 
also decides whether to observe the reward or not, and this 
decision of the learner at time t is denoted by p t £ {0,1}. 
If = 1, then the learner chooses to observe the reward, 
else if Pt = 0, then the learner does not observe the reward. 
The learner’s performance loss with respect to the oracle 
benchmark is defined as the regret, whose value at time T 
is given by 

T T 

R (T) ■= ^2 M a *(x t )) - ^(Mt(at) - coPt)- (1) 

t= 1 t= 1 

Different from the definitions of regret in related works ED- 
m, there is an additional cost co, which is called the active 
learning/exploration cost. Hence the goal of the learner is to 
maximize its total reward while balancing the active learning 
costs incurred when observing the rewards. The algorithm 
we propose in this paper is able to achieve a given tradeoff 
between the two by actively controlling when to observe the 
rewards. 

A regret that grows sublinearly in T, i.e., 0(T 7 )> 7 < 1. 
guarantees convergence in terms of the average reward, i.e., 
R(T)/T 0. We are interested in achieving sublinear growth 
with a rate only depending on D re \ independent of D. 

IV. Online Learning of Relevance Relations 
A. Relevance Learning with Feedback 

In this section we propose the algorithm Relevance LEArn- 
ing with Feedback (RELEAF), which learns the best action for 
each context vector by simultaneously learning the relevance 
relation, and then estimating the expected reward of each 
action based on the values of the contexts of the relevant types. 
The feedback, i.e., reward observations, is controlled based on 
the past context vector arrivals, in a way that the reward ob¬ 
servations are only made for actions for which the uncertainty 
in the reward estimates are high for the current context vector. 
The controlled feedback feature allows RELEAF to operate 
as an active learning algorithm. RELEAF has a relevance 
parameter 7 re i which is the number of relevant types it will 
learn for each action. In order to have analytic bounds on the 
regret, it is required that 7 re ] > D re j. However, the numerical 
results in Section [V] show that even with 7 re i = 1, RELEAF 
performs very well on several real-world datasets. We assume 
that RELEAF knows I) rc \ but not 1Z. Hence, in this paper we 
assume that RELEAF is run with 7 re i = D le \. In theory, it is 
enough for RELEAF to know an upper bound D le i on D le \. 
Then, the regret of RELEAF will depend on D le \. Operation 
of RELEAF can be summarized as follows: 

• Adaptively form partitions (composed of intervals) of the 
context space of each type in V and use them to learn the 
action rewards of similar context vectors together from 
the history of observations. 

• For an action, form reward estimates for 27 re i-tuple of 
intervals corresponding to 27 re i-tuple of types. Based on 
the accuracy of these estimates, either choose to explore 
and observe the reward (by paying cost cq for active 


learning) or choose to exploit the best estimated action 
(but do not observe the reward) for the current context 
vector. 

• In order to estimate the expected rewards of the actions 
accurately, find the set of 7 re i-tuple of types relevant to 
each action a. For instance, a 7 re i-tuple of types v £ V 7rel 
is relevant to action a if IZ(a) C v. Conclude that v is 
relevant to a if the variation of the reward estimates does 
not greatly exceed the natural variation of the expected 
reward of action a over the hypercube corresponding to 
v formed by intervals of type i £ v (calculated using 
Similarity Assumption). 


Relevance Learning with Feedback (RELEAF): 

1: Input: L, p, S, 7 re i. 

2: Initialization: Vi i = {[0,1]}, i € 27. Run Initializer/, Vi i, 
1), i £ V. 

3: while t > 1 do 

4: Observe xt , find p t that xt belongs to. 

5: Set lit := {J ieT) where W;,t (given in <(5j), is the set 

of under explored actions for type i. 

6: if Ut ^ 0 then 

7: (Explore) pt = 1, select at randomly from Ut, 

observe rt(at, xt). 

8: Update sample mean reward of at corresponding to 

27rei-tuples of intervals: for all q £ Qt, given in CD. 
r vM (q,a t ) = (q, a t )r^ g) (q, a t ) + 

rt(at,x t ))/(S vM (q,a t ) + 1). 

9: Update counters: for all q £ Q t , S v ^ g \q, at) + +. 

10: else 

11: (Exploit) Pt = 0, for each a £ A calculate the set of 

candidate relevant contexts Relt(a) given in <j6j. 

12: for a £ A do 

13: if ReL(a) = 0 then 

14: Randomly select Ct.(a) from V 7rd . 

15: else 

16: For each i £ Rel t (a), calculate Vart(i>,a) given 

in 0. 

17: Set c t (a) = arg min„ SRelt(a) Var t (v, a). 

18: end if 

19: Calculate r ct( ' a \Pc t ( a ) : t a ) as given in 0. 

20: end for 

21: Select a t = argma x aeA r ct(a) (p &t(aht a). 

22: end if 

23: for i £ V do 

24: IV i (p i , t ) + +. 

25: if ^(pij) > 2 pi i p< ' 4> then 

26: Create two new level l{j>%,t ) + 1 intervals p, p' 

whose union gives p^t. 

27: Vi,t+1 = Vi,t U {p,p'\ — {pi,t}- 

28: Run Initialize!), {p,p }, t). 

29: else 

30: Vi,t +1 - Vi,t. 

31: end if 

32: end for 

33: isi+1 

34: end while 

Initialize!), £>, t): 

1: for p £ B do 

2: Set IV* (p) = 0, f("(9).0(( 9) p) ia ) = 0, 

S( 1J ( |J )’ l )((q,p),a) = 0 for all 27 rc i-tuple of types 

(v(q),i) that contain type ), for all a £ .A such that 
(q,P) C V (• 

3: end for 


Fig. 1. Pseudocode for RELEAF. 
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In order to learn fast, RELEAF exploits the similarities 
between the context vectors of the relevant type^ given in the 
Similarity Assumption to estimate the rewards of the actions. 
The key to success of our algorithm is that this estimation is 
good enough if relevant tuples of types for each action are 
correctly identified. Since in Big Data applications D can be 
very large, learning the /Ad-tuple of types that is relevant to 
each action greatly increases the learning speed. 

RELEAF adaptively forms the partition of the space for 
each type in T>, where the partition for the context space of 
type i at time t is denoted by Pip. All the elements of P,p 
are disjoint intervals of X, whose lengths are elements of the 
set {1, 2 _1 , 2 -2 ,.. ,}0 An interval with length 2~ l , l > 0 is 
called a level l interval, and for an interval p, l(p) denotes 
its level, s(p) denotes its length. By convention, intervals are 
of the form (a, b ], with the only exception being the interval 
containing 0, which is of the form [0,6]H Let ptp £ Pip be 
the interval that Xip belongs to, p t := (pip, ■ ■ ■ ,PD,t) and 
Pt ■■= (Pi,t, ■ ■ ■ ,V D ,t)- For v £ V K , 1 < K < D, let p v t 
denote the elements of p t corresponding to types in v, and let 
P v,t X-i£vPi,t‘ 

The pseudocode of RELEAF is given in Fig. |T| RELEAF 
starts with 7\i = {Xi} = {[0,1]} for each i £ V. As time 
goes on and more contexts arrive for each type i, it divides 
Xi into smaller and smaller intervals. Then, these intervals 
are used to create 27 re i-dimensional hypercubes corresponding 
to 27 re i-tuples of types, and past observations corresponding 
to context vectors lying in these hypercubes are used to 
form sample mean reward estimates of the expected action 
rewards. The intervals are created in a way to balance the 
variation of the sample mean rewards due to the number 
of past observations that are used to calculate them and the 
variation of the expected rewards in each hypercube formed 
by the intervals. For each interval p £ Pip, RELEAF keeps a 
counter for the number of type i context arrivals to p. When the 
value of this counter exceeds 2 pl<J> \ where p > 0 is an input 
of RELEAF called the duration parameter, p is destroyed 
and two level l{p) + 1 intervals, whose union gives p are 
created. For example, when Pip = ( k2 ~ l , (fc + l)2 - '] for some 
0 < k < 2 l - 1 if Nlipi.t) > 2 pl , RELEAF sets 

Pi,t +1 = 'Pip — {Pi,t} 

U {(k2~ l , (k + 1/2)2-*], ((fc + 1/2)2"*, (k + 1)2-']}. 

Otherwise Pip+i remains the same as P, t . It is easy to see 
that the lifetime of an interval increases exponentially in its 
duration parameter. 

We next describe the control numbers RELEAF keeps for 
each type i, the counters and sample mean rewards RELEAF 
keeps for 27 re ]-tuples of intervals ( 27 ie i-dimensional hyper¬ 
cubes) corresponding to a 27 re i-tuple of types to determine 
whether to explore or exploit and how to exploit. Let Vk(i) 

3 RELEAF only needs to know L but not 7tt. Even if L is not known, it can 
use a slowly increasing function L(t ) as an estimate for L so that a sublinear 
regret bound will hold for a time horizon T such that L(T) > L. 

4 Setting interval lengths to powers of 2 is for presentational simplicity. In 
general, interval lengths can be set to powers of any integer greater than 1. 

Tindpoints of intervals will not matter in our analysis, so our results will 
hold even when the intervals have common endpoints. 


be the set of A'-tuples of types that contains type i. For each 
v £ Vk(i), we have i £ v. 

Let V_ v := V — {«}. For type i, let Qip := {p v , t ’■ v £ 
V 2 7rd (i)} be the set of 27 re i-tuples of intervals that includes 
an interval belonging to type i at time t, and let 

Qt ■= U Qi,f (2) 

iGT> 


To denote an element of Qip or Q t we use index q. 
For any q £ Q t , the tuple of types corresponding to the 
tuple intervals in q is denoted by v(q). For instance if 

Q = fail , <7*2 . • ■ •, fe 7 ni)’ then v (q) = (*i, * 2 , • ■ ■, * 2 7 rei). The 
decision to explore or exploit at time t is solely based on p t . 
For events Ai,..., Ak, let I(Ai, ..., Ak) denote the indicator 
function of event C\k—vK^- Let 

t 

St iq \q,a) := (“*' = a ’^' = l 'Pv{q),f = q) > 

t'=l 

be the number of times a is selected and the reward is observed 
when the context values corresponding to types v(q) are in q 
and q £ T > v ^ q )p. Also let 

rt (q \q,a) 

£t'=i r v (a,x v )\ (a t ' = a ,/% = l,p v{q) t , = q^j 
^ S^\q,a) ’ 


be the sample mean reward of action a for 27 re ]-tuple of 
intervals q. 

At time t, RELEAF assigns a control number to each i £ T> 
denoted by 


2\og{tD*\A\/S) 

(Ls(pip)) 2 


(3) 


where 


D* 


( D - 1 
V 2 7rel - 1/ 


(4) 


This number depends on the cardinality of A, the length of 
the active interval that type i context is in at time t, and a 
confidence parameter 5 > 0, which controls the accuracy of 
sample mean reward estimates. Dip is a sufficient number of 
reward observations from an action, which guarantees that the 
estimated reward for that action will be sufficiently close to 
the expected reward for the context at time t. By sufficiently 
close we mean that when i is the relevant type of context for 
the action, the difference between the true expected reward of 
that action and the estimated expected reward will be less than 
a constant factor of the length of the interval that contains 
the type i context due to the Similarity Assumption. The 
control function ensures that within each hypercube, the rate 
of exploration only increases logarithmically in time. It also 
guarantees that each action is explored at least ~ 1 /s{pip) 2 
times, which guarantees that the regret due to exploitations in 
each hypercube is small enough to achieve a sublinear regret 
bound (see Theorem 1). 

Then, it computes the set of under-explored actions for type 

i as 


Hip := (a£ A: a) < D t p 
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for some q £ Qi(t)} , 


(5) 


and then, the set of under-explored actions as lit '■= U, rD Mi,t- 
The decision to explore or exploit is based on whether or not 
lit is empty, as follows: 

(i) If lit 7 ^ 0, RELEAF randomly selects an action at. £ U t 
to explore, and observes its reward 77 ( 07 , xf). Reward ob¬ 
servation costs co, which is the active learning cost. Then, it 
updates the sample mean rewards and counters for all q £ Qt, 


rt+i(q,a t ) = 


St(<l,at)rt+i(q,at)+rt{at,xt) 

S^\q,a t ) + 1 


S;lf(q,a t ) = S^\ q ,a t ) + l. 

(ii) If lit = 0, RELEAF exploits by estimating the relevant 
Trci-tuple of types c t («) for each a £ A and forming sample 
mean reward estimates for action a based on <7 (a). It first 
computes the set of candidate relevant tuples of types for each 
a £ A. For each v £ V 7rel , let V 2 7rd ( 1 ’) t> e the set of ^im\- 
tuples of types such that v IT w = v for w £ V- 2 7rel (v). 


B. Why sample mean reward estimates for 2~f re /-tuple of 
inten’als are required? 

Assume that RELEAF knows D le \, hence 7 re i = D le \. Then, 
RELEAF computes sample mean reward estimates for 2D xe \- 
tuples of intervals corresponding to different types and uses 
them to learn the action with the highest reward by learning the 
relevant /J r d-tuples of types. However, is it possible to learn 
the action with the highest reward by only forming sample 
mean estimates for /A rc |-tuples of intervals? For instance con¬ 
sider the case when D le \ = 1 and the following greedy learning 
algorithm called Greedy-RELEAF, outlined as follows: 

(i) Form sample mean reward estimates of each action a for 
each type i £ T>, i.e., r\(p, a), p £ Vi,t based only on the con¬ 
text arrivals corresponding to type i\ (ii) In exploitation steps 
choose the action with the highest sample mean reward over all 
sets of intervals in p t , i.e., argmax a6 _4 maxi 6 x> r\{pi(t), a). 
The following lemma shows that there exists a context arrival 
process for which the regret of Greedy-RELEAF will be linear 
in time. 


Relt(a) := £ V 7[d : | r™{p wt ,a) - rf{p w , t ,a)\ 

< 3L^y~{maxs{p it t)yw,w' £ V 27 rel ('n) ^ . ( 6 ) 

i£v J 

The intuition is that if the tuple of types v contains the tuple 
of types IZ(a) that is relevant to a, then independent of the 
values of the contexts of the other types, the variation of the 
pairwise sample mean reward of a over p w t must be very 
close to the variation of the expected reward of a in p v t for 
w £ V 2 D re i (v) in exploitation steps. 

If Rel t (a) is empty, this implies that RELEAF failed to 
identify the relevant tuple of types, hence Ct(a) is randomly 
selected from V 7rd . If Rel 4 (a) is nonempty, RELEAF computes 
the maximum variation 


Var t (u,a) 


:= max | ff(p wt 

W,w'eV2-y al W) 


a)-rf{p w > >t a)\, (7) 


for each v £ Rel t (a). Then it sets c t (a) = 
m i n «eReit(a) Var t (i>, a). This way, whenever 7 Z(a) C v 
for some v £ Rel t (a), even if v is not selected as the 
estimated relevant tuple of types, the sample mean reward of 
a calculated based on the estimated relevant tuple of types 
will be very close to the sample mean of its reward calculated 
according to IZ(a). After finding the estimated relevant tuple 
of types ct{a ) for a £ A. the sample mean rewards of the 
actions are computed as 

rt t(a \Pc t (a),v a ) 

E ^TiP w ,v a ) S r(P w ,t, a ) 

jm gV2 7ld (c t (a)) _ 

E S r(Pw,t, a ) 

w>eV 27rd (ct(a)) 

Then, RELEAF selects 

a t = argma xr c t t{a) (p £t{a) t ,a). 

Different from explorations, since the reward is not observed 
in exploitations, sample mean rewards and counters are not 
updated. 


Lemma 1. Let A = {a, b}, V = {i,j}, 7 Z(a) = i, 7 Z(b) = j. 
Xi(t) = x for all t and Xj(t ) = 1 with probability 0.8 and 
Xj(t ) = 0 with probability 0.2 for all t independently. Assume 
that p,(a,x) = 0.5 and p,(b,Xj(t)) = Xj{t). Then, we have 
R(T) = 0(T). 

Proof: Given that Greedy-RELEAF explores sufficiently 
many times, at an exploitation step t when the context vector 
is (x, 0), we have 

p (I rt(Pi(t),a) - 0.51 < 0.1, \rl(pi(t),b) - 0.8| < 0.1, 

I r 3 t {pj(t),a) - 0.5| < O.l) > 0.5 

for any Piit) containing x and pj (t) containing 0. At such a t 
Greedy-RELEAF will select action b with probability at least 
0.5, resulting in an expected regret of at least 0.5 2 . Assume 
that the context vector arrivals are such that (x, 0) appears in 
more than 50% of the time for all T large enough. Then, the 
regret of Greedy-RELEAF will be linear in T. ■ 

For the problem instance given in Lemma E RE¬ 
LEAF will calculate and compare sample mean rewards 
f t J ( (pt (t). p.j (t )),«) for pairs of intervals corresponding to 
different types instead of directly forming sample mean re¬ 
wards for intervals of each type; hence in exploitations it can 
identify that the type relevant to action a is i and action b 
is j with a very high probability. We will prove this in the 
following subsection by deriving a sublinear in time regret 
bound for RELEAF for the case when D re \ = 1. A general 
regret bound for 1 < D K \ < D/2 is proven in our online 
technical report ll33l . 

C. Regret analysis of RELEAF for D re i = 1 

In this section we derive analytical regret bounds for RE¬ 
LEAF. For simplicity of exposition, we prove our bounds for 
the special case when D re \ = 1, i.e., when the relevance 
relation is a function, and RELEAF is run with 7 re i = D le \. 
Although D re i = 1 is the simplest special case, our numer¬ 
ical results on real-world datasets in Section [V] shows that 
RELEAF performs very well with 7 re i = 1. 
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Let r(T) c {1, 2,..., T} be the set of time steps in which 
RELEAF exploits by time T. t(T) is a random set which 
depends on context arrivals and the randomness of the action 
selection of RELEAF. The regret R(T) defined in ([!]) can 
be written as a sum of the regret incurred during explo¬ 
rations (denoted by Ro(T)) and the regret incurred during 
exploitations (denoted by Ii\{T)). Computing the two regrets 
separately gives more flexibility when choosing the parameter 
of RELEAF according to the objective of the learner. Although 
the definition of the regret in fl}, allows us to write regret 
as Ro{T) + Ri(T), the learner can set the parameters of 
RELEAF according to other objectives such as minimizing 
R\(T) subject to Ro(T) < K for a fixed T and K > 0, or 
minimizing the time order of the regret when it is a more 
general function of regret in explorations and exploitations, 
i.e., f(Ro(T) 7 Ri(T)). For instance, in an online prediction 
problem, if the cost of accessing the true label (exploration) 
is small, but the cost of making a prediction error in an 
exploitation step is very large, the learner can trade off to 
have higher rate of explorations. 

The following theorem gives a bound on the regret of 
RELEAF in exploitation steps. 

Theorem 1. Let RELEAF run with relevance parameter 7 „,/ = 
1 , duration parameter p > 0, confidence parameter S > 0 and 
control numbers 

_ 2 \og{t\A\D/5) 

M ' {Ls{ Pi ,t)f ’ 

for i £ T>. Let R insr (t) be the instantaneous regret at time t, 
which is the loss in expected reward at time t due to not 
selecting a*(x t ). When the relevance relation is such that 
D re i = 1, then, with probability at least 1 — S, we have 

Rinst(f) f ^^ J {.^i.P'TZ{a t ),t) ~b s(,PlZ(a* (x t )),t ) ) ; 

for all t £ t(T), and the total regret in exploitation steps is 
bounded above by 

R[(T) < 8 L E (s(pn{ ),t T s {PlZ(a* (x t )) ,t)) 

ter(T) 

< 16LD2 2p T p/{1+p) , 


for arbitrary context vectors Xi,X 2 Hence 

Ri{T)/T = 0(T~ 1 /( 1 + p')), and lim-r^ RfiT) = 0. 

Proof: The proof is given in Appendix [A] ■ 

Theorem Q] provides both context arrival process dependent 
and worst case bounds on the exploitation regret of RELEAF. 
By choosing p arbitrarily close to zero, R\{T) can be made 
0(T 7 ) for any 7 > 0. While this is true, the reduction in 
regret for smaller p not only comes from increased accuracy, 
but it is also due to the reduction in the number of time steps 
in which RELEAF exploits, i.e., |r(T)|. By definition, time t 
is an exploitation step if 


si l ' J \pi,t,Pj,t,a) > 


2log(t\A\D/S) 

L 2 min{s(p M ) 2 , s( Pj , t ) 2 } 

2 2max{Z(p 1 , t ),I(P7, t )} + l log ^|_4| £ ,/^ 


L 2 


for all q = ( Pi,t,Pj,t ) £ Qt, fj £ 2?. This implies that for any 
q £ Qi,t which has the interval with maximum level equal to 
l, 0(2 21 ) explorations are required before any exploitation can 
take place. Since the time a level l interval can stay active is 
2 pl , it is required that p > 2 so that r(T) is nonempty. 

The next theorem gives a bound on the regret of RELEAF 
in exploration steps. 


Theorem 2. Let RELEAF run with 7 re /, p, S and I) l t , i £ V 
values as stated in Theorem Q] When the relevance relation is 
such that D re i = 1, we have 


Ro{T ) < 
+ 


960D 2 (co + l)log(T\A\D/6) 4/o 
7 L 2 

64D 2 (cq + 1) ^ 2/p 
3 


with probability 1, for arbitrary context vectors 
Xi,X 2 , ■ ■ ■ ,Xt- Hence Ro(T)/T = 0(T^~ P ^ P ), and 

liniT-s-oo Ro(T) = 0 for p > 4. 


Proof: The proof is given in Appendix |B] ■ 

Based on the choice of the duration parameter p, which 
determines how long an interval will stay active, it is possible 
to get different regret bounds for explorations and exploita¬ 
tions. Any p > 4 will give a sublinear regret bound for 
both explorations and exploitations. The regret in exploitations 
increases in p while the regret in explorations decreases in p. 


Theorem 3. Let RELEAF run with y re i, S and Di t , i £ T> 
values as stated in Theorem\T\and p = 2+2\/2. Then, the time 
order of exploration and exploitation regrets are balanced up 
to logarithmic orders. With probability at least 1 — 8 we have 
both RfiT) = 0(r 2 /( 1 +^ 2 )) and R 0 {T) = 0(r 2 /( 1 +' /2 )) . 


Proof: The time order of the exploitation regret is increas¬ 
ing in p from the result of Theorem [Q and the time order of 
the exploration regret is decreasing in p from the result of 
Theorem [2] The time orders of both regrets are be balanced 
when pi (1 + p) = 4/p, which gives the result. ■ 

Another interesting case is when actions with suboptimality 
greater than e > 0 must never be chosen in any exploitation 
step by time T. When such a condition is imposed, RELEAF 
can start with partitions Vi, 1 that have intervals with high 
levels such that it explores more at the beginning to have 
more accurate reward estimates before any exploitation. The 
following theorem gives the regret bound of RELEAF for this 
case. 


Theorem 4. Let RELEAF run with relevance parameter 7 re i = 
1 , duration parameter p > 0 , confidence parameter 5 > 0 , 
control numbers 

_ 2 \og(t\A\D/8) 

M ' (Ls(p z , t )) 2 ’ 

and with initial partitions Vi, 1 , i £ T> consisting of inten’als 
with levels Z m ; n = [log 2 (3L/(2e))j. When the relevance 
relation is such that D re i = 1, then, with probability 1 — 5, we 
have 


Rinstif ) L C, 








for all t £ t(T), 


RfiT) < l&L2 2p T p/{1+p) , 


and 


Ro{T) < 81 IS / 960£> 2 (cq + 1) log(T|,A|-D/<$) ^ i4/o 

7L 2 


+ 


MD 2 {cq + 1) t2/d ^ 


for arbitrary context vectors X\, X2, ■ ■ ■, Xt- Bounds on Ri(T) 
and Ro(T) are balanced for p = 2 + 2\/2. 

Proof: The proof is given in Appendix O ■ 


D. Regret bound for RELEAF for D re j < D/2 

Similar to the analysis in the previous subsection, RELEAF 
achieves sublinear in D le 1 regret for any D re \ < D/2. 

Theorem 5. Let RELEAF run with relevance parameter ~/ re i = 
D re i, duration parameter p > 0, confidence parameter S > 0 
and control numbers 

2 \og(t\A\D* / 5) 

(- Ls(p i>t )) 2 ’ 

for i £ T>, where D* is given in Q- Then, with probability 
at least 1 — S we have Ri(T) = 0(T 9 ^ D " 1 ' 1 ) and Ro(T) = 
d(T 9 ( D "*l), where 


gi^Drel) • — 


2 + 2 D re \ + yj 4 D 2 el + 16 D re i +12 
4 + 2 D re \ + yj ^D 2 el + 16-D re / +12 


Proof: The proof is given in Appendix [D] ■ 

The bound on the regret given in Theorem [5] matches the 
bound in Theorem [3] for I) rc \ = 1. 

Remark 1. The regret bound in Theorem [5] is better than the 
generic regret bound 0(T(' D+1 )/(' D+2 /) for contextual bandit 
algorithms ifTil/ , £31/ that does not exploit the existence of 
relevance relations when D re i < D/ 2—1. 


V. Numerical Results 

In this section, we numerically compare the performance 
of our learning algorithm with state-of-the-art learning tech¬ 
niques, including ensemble learning methods and other multi¬ 
armed bandit algorithms for three different real-world datasets: 
(i) breast cancer diagnosis, (ii) network intrusion detection, 
(iii) webpage recommendation. The purpose of simulations for 
the first two datasets is to show that RELEAF can learn to 
make accurate prediction without the need of base classifiers, 
which are required by ensemble learners. The purpose of 
simulations for the third dataset is to show that RELEAF can 
learn to make accurate recommendations based on the context 
vectors of the users, by only observing the click information 
for the recommended webpage. 


A. Datasets 

Breast Cancer (BC) J34): The dataset consists of features 
extracted from the images of fine needle aspirate (FNA) of 
breast mass, that gives information about the size, shape, 
uniformity, etc., of the cells. Each feature has a finite number 
of values that it can take, and the values of features are 
normalized such that they lie in [0,1], Each case is labeled 
either as “malignant” or “benign”. We assume that images 
arrive to the learner in an online fashion. At each time slot, the 
learning algorithm operates on a 9 dimensional feature vector 
which consists of a subset of the features extracted from the 
same image. 

The prediction action belongs to the set 
{benign, malignant}. Reward is 1 when the prediction 
is correct and 0 else. 50000 instances are created by 
duplication of the data and are randomly sequenced. Out of 
these 69% of the instances are labeled as “benign” while the 
rest is “labeled” as malignant. 

Network Intrusion (NI) (34l : The network intrusion dataset 
from UCI archive ll34l consists of a series of TCP connection 
records, labeled either as normal connections or as attacks. 
The data consists of 42 features, and we take 15 of them as 
types of contexts. Taken features are normalized to lie in [0,1]. 
The prediction action belongs to the set {attack, noattack}. 
Reward is 1 when the prediction is correct and 0 otherwise. 
Webpage Recommendation (WR) Q: This dataset contains 
webpage recommendations of Yahoo! Front Page which is an 
Internet news website. Each instance of this dataset consists 
of (i) IDs of the recommended items and their features, (ii) 
context vector of the user, and (iii) user click information. For 
a recommended webpage (item), reward is 1 if the user clicks 
on the item and 0 otherwise. The context vector for each user 
is generated by mapping a higher dimensional set of features 
of the user including features such as gender, age, purchase 
history, etc. to [0, l] 5 . The details of this mapping is given in 
ED. We select 5 items and consider T = 10000 user arrivals. 

B. Learning algorithms 

Next we briefly summarize the algorithms considered in our 
evaluation: 

RELEAF: Our algorithm given in Fig. [I] with control 
divided by 5000 to reduce the number of 

RELEAF-ALL: Same as RELEAF except that reward of 
the selected action is observed in every time step. This version 
is useful when the reward of the selected action can be 
observed with no cost. 

RELEAF-FO: Same as RELEAF except that it observes 
the rewards of all actions instead of the reward of the selected 
action. We refer to this version of our algorithm as RELEAF 
with full observation (RELEAF-FO). 

Normalization is done in the following way: maximum and minimum 
context values in the dataset are found. Minimum context value is subtracted 
from all contexts, then the result is divided by the difference between the 
maximum and minimum values 

7 The theoretical bounds are proven to hold for worst-case context vector 
arrivals and reward distributions. In practice, the relevance relation and the 
order of action rewards are identified correctly with much less explorations. 


numbers D x t 
explorations Q 
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Contextual zooming (CZ) fl4l : This algorithm adaptively 
creates balls over the joint action and context space, calculates 
an index for each ball based on the history of selections of 
that ball, and at each time step selects an action according to 
the ball with the highest index that contains the action-context 
pair. 

Hybrid- e f35l : This algorithm is the contextual version of e- 
greedy, which forms context-dependent sample mean rewards 
for the actions by considering the history of observations and 
decisions for groups of contexts that are similar to each other. 

LinUCB J3]|: This algorithm computes an index for each 
action by assuming that the expected reward of an action is a 
linear combination of different types of contexts. The action 
with the highest index is selected at each time step. 

Ensemble Learning Methods Average Majority (AM) 13, 
Adaboost 130), Online Adaboost ED and Blum’s Variant of 
Weighted Majority (Blum) |[32l : The goal of ensemble learning 
is to create a strong (high accuracy) classifier by combining 
predictions of base classifiers. Hence all these methods require 
base classifiers (trained a priori) that produce predictions (or 
actions) based on the context vector. 

AM simply follows the prediction of the majority of the 
classifiers and does not perform active learning. Adaboost is 
trained a priori with 1500 instances, whose labels are used to 
compute the weight vector. Its weight vector is fixed during the 
test phase (it is not learning online); hence no active learning is 
performed during the test phase. In contrast. Online Adaboost 
always receives the true label at the end of each time slot. It 
uses a time window of 1000 past observations to retrain its 
weight vector. Similar to Online Adaboost, Blum also learns 
its weight vector online. The key differences between our 
algorithm and the methods that we compare against are given 
in Table [II] 

C. Breast cancer simulations 

In this section we compare the performance of RELEAF, 
RELEAF-ALL and RELEAF-FO with other learning methods 
described in Section fV-BI For the ensemble learning methods, 
there are 6 logistic regression base classifiers, each trained 
with a different set of 10 instances. 

The simulation results are given in Table uni Since 
RELEAF-FO updates the reward of both predictions after the 
label is received, it achieves lower error rates compared to 
RELEAF. In this setting it is natural to assume that the reward 
of both predictions are updated, because observing the label 
gives information about which prediction is correct. RELEAF- 
ALL which observes all the labels has the lowest error rate. 

Among the ensemble learning schemes Adaboost and On¬ 
line Adaboost performs the best, however, their error rates are 
more than two times higher than the error rate of RELEAF and 
about three times higher than the error rate of RELEAF-FO. 
Although the number of actively obtained labels (explorations) 
for RELEAF and RELEAF-FO are higher than the initial 
training samples used to train Adaboost; neither RELEAF 
nor RELEAF-FO has a predetermined exploration size as 
Adaboost. This is especially beneficial when time horizon of 
interest is unknown or prediction performance is desired to be 


uniformly good over all time instances. CZ is the best among 
the other multi-armed bandit algorithms with 3.15% error, but 
worse than RELEAF which has 1.88% error. 

D. Network intrusion simulations 

In this section we compare the performance of RELEAF, 
RELEAF-ALL and RELEAF-FO with other learning methods 
described in Section [V-B I For the ensemble learning methods, 
the base classifiers are logistic regression classifiers, each 
trained with 5000 different instances from the NI. Comparison 
of performances in terms of the error rate is given in Table II VI 
We see that RELEAF-FO has the lowest error rate at 0.68%, 
more than two times better than any of the ensemble learning 
methods. All the ensemble learning methods we compare 
against use classifiers to make predictions, and these classifiers 
require a priori training. In contrast, RELEAF and RELEAF- 
FO do not require any a priori training, learn online and 
require only a small number of label observations (i.e. they 
can perform active learning). 

CZ performs very poorly in this simulation because its 
learning rate is sensitive to Lipschitz constant that is given as 
an input to the algorithm which we set equal to 0.5. Numerical 
results related to the performance of CZ and RELEAF for 
different L values can be found in our online technical report 
l33l . LinUCB performs the best in terms of the overall rate 
of error, but if we consider the error rate of RELEAF in 
exploitations it is better than LinUCB. This highlights the 
finding of Theorem [T] regarding RELEAF, which states that 
highly suboptimal actions are not chosen in exploitations with 
a high probability. 


Algorithm 

error % 

exploitation 
error % 

number ol 
label observations 

AM 

3.07 

N/A 

0 

Adaboost 

3.1 

N/A 

1500 

Online 

Adaboost 

2.25 

N7A 

all 

Blum 

1.64 

N7A 

all 

CZ 

53 

N7A 

all 

Hybrid-e 

8.8 

N7A 

all 

LinUCB 

0.27 

N7A 

all 

KEEEAE 

1.19 

U7I4 

398 

KEEEAE-ALL 

1.07 

TH 

all 

KEEEAE-EO 

0.68 

XHA 

229 


TABLE IV 

Comparison of the error rates of RELEAF-FO with ensemble 

LEARNING METHODS FOR NETWORK INTRUSION DATASET. 


E. Webpage recommendation simulations 

In this dataset only the click behavior of the user for the 
recommended item is observed. Moreover, it is reasonable to 
assume that the click behavior feedback is always available (no 
costly observations). The ensemble learning methods require 
availability of experts recommending actions and full reward 
feedback including the rewards of the actions that are not 
selected, to update the weights of the experts, hence they are 
not suitable for this dataset. In contrast, multi-armed bandit 
methods are more suitable since only the feedback about 
the reward of the chosen action is required. Hence we only 
compare RELEAF-ALL, CZ, LinUCB and Hybrid-e for this 
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Algorithm 

Base classifiers 

Prior training 

Online Learning 

Active learning 

AM j5| 

required 

no 

no 

no 

Adaboost 1301 

required 

required 

no 

no 

Online Adaboost |30"|, Blum (351 

required 

required 

yes 

no 

€771141. Hvbrid-e |[351 . LinUCB 1311 

not required 

not required 

yes 

no 

RELEAF 

not required 

not required 

yes 

yes 


TABLE II 

Properties of RELEAF, ensemble learning methods and other contextual bandit algorithms. 


Algorithm 

Performance 

error % 

missed % 

talse % 

number ot 

label observations 

active learning 
cost for co = 1 

AM 

8.22 

17.20 

4.09 

0 (no online learning) 

~ 0 

Adaboost 

4.60 

3^7 

4.97 

1500 (to train weights) 

1500 

Online Adaboost 

4.68 

T07 

4.95 

all labels are observed 

50000 

Blum 

11.18 

27.12 

3.86 

all labels are observed 

50000 

CZ 

3.15 

T74 

2.89 

all labels are observed 

50000 

Hybrid-e 

8.83 

11.77 

7.48 

all labels are observed 

50000 

LinUCB 

10.67 

TT1 

12.22 

all labels are observed 

50000 

BELLAE 

1.88 

1.93 

1.86 

7630 

2630 

KELEAE-ALL 

1.24 

1.19 

1.36 

all labels are observed 

50000 

KELEAE-EO 

1.68 

1.34 

1.82 

7630 

2630 


TABLE III 

Comparison of RELEAF with ensemble learning methods and other contextual bandit algorithms for the breast cancer dataset. 


dataset. We compare the click through rates (CTRs), i.e., 
average number of times the recommended item is clicked, 
of all algorithms in Table [V] We observe that RELEAF-ALL 
has the highest CTR. 

F. Identifying the relevant types 

When RELEAF exploits at time t, it identifies a relevant 
type c t (a) for every action a G A and selects the arm with 
the highest sample mean reward according to its estimated 
relevant type. Hence, the value of the context of the relevant 
type plays an important role on how well RELEAF performs. 

For each dataset we choose a single action and for each 
chosen action show in Table I VII the percentage of times a type 
is selected as the type that is relevant to that action in the time 
slots that RELEAF exploits. Since there are many types, only 
the 4 of the types which are selected as the relevant type for 
the corresponding action highest number of times are shown. 
For instance, for BC in 70% of the exploitation slots the type 
identified as the type relevant to action “predict benign” comes 
from a 3 element subset of the set of 9 types in the data. 
Similarly for NI the type identified as the type relevant to 
action “predict attack” comes from a 2 element subset of the 
set of 15 types in the data for 85% of the exploitation slots. 

This information provided by RELEAF can be used to 
identify the relevance relation that is present in a dataset. 
For instance, consider the NI dataset. Since the type that is 
assigned as the estimated relevant type most of the times is 
only assigned in 45% of the exploitation slots, for the NI 
dataset we should have D re \ > 1. However, since the pair 
of types that are assigned as the estimated relevant type most 
of the times is assigned in 85% of the exploitation slots, we 
can conclude that approximately D le \ < 2 for the NI dataset. 

VI. Conclusion 

In this paper we formalized the problem of learning the best 
action (prediction, recommendation etc.) to be taken based 


Abbreviation 

CTR 

CZ 

377R 

Hybrid-e 

6.41 

LinUCB 

6.1)6 

REEEAE-ALE 

6.62 


TABLE V 

Comparison of the click through rates (CTRs) of RELEAF, CZ, 
Hybrid-6 and LinUCB for webpage recommendation dataset. 


Dataset 

Action 

highest rates ot relevance 

highest 

type-rate 

2nd highest 

type-rate 

3rd highest 

type-rate 

4th highest 

type-rate 

BC 

predict “benign” 

3-27% 

1-22% 

7-21% 

2-12% 

NI 

predict “attack" 

1-45% 

15-40% 

THWc 

4-5% 

WR 

recommend 

webpage a 

3-46% 

rwfc 

2-8% 

^Wc 

WR 

recommend 

webpage b 

2-57% 

TFWc 

JWc 



TABLE VI 

Average number of times RELEAF identified a type as the type 
relevant to the specified action in exploitations. 


on the current streaming Big Data by online learning the 
relevance relation between types of contexts and actions. We 
proposed an algorithm that (i) has sublinear regret with time 
order independent of D, (ii) only requires reward observations 
in explorations, (iii) for any e > 0, does not select any e 
suboptimal actions in exploitations with a high probability. 
We illustrated the properties of the proposed algorithm via 
extensive numerical simulations on real-data, showed that it 
achieves high average reward and identifies the set of relevant 
types. The proposed algorithm can be used in a variety of 
application (including applications requiring active learning) 
such as medical diagnosis, recommender systems and stream 
mining problems. An interesting future research direction is 
learning both relevant types of contexts and relevant type of ac¬ 
tions for multi-armed bandit problems with high dimensional 
action and context spaces. 
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Appendix A 
Proof of Theorem 1 

Let A := |^4|. We first define a sequence of events which 
will be used in the analysis of the regret of RELEAF. For 

P S V n{a)}t , Let 7T (a,p) = p(a,x* n{a) (p)), where x* n{a) (p ) is 

the context at the geometric center of p. For j £ r D-'R.[a), let 

INACC t (aJ) := j \f[ nia) j) {(j>n(a),uPj,t), a) ~ ^{a,Pn(a),t)\ 
> 7^s(Pw(a),t)} , 

be the event that the pairwise sample mean corresponding 
to pair (lZ(a),j) of types is inaccurate for action a. Let 
ACC t(a) := fljei) 1U ) INACQ(a, j) c , be the event that all 
pairwise sample means corresponding to pairs ( lZ(a),j ), j £ 
^-n(a) are accurate. Consider t £ r(T). Let WNG t (a) := 
{' TZ(a ) ^ Rel t (a)}, be the event that the type relevant to 
action a is not in the set of candidate relevant types, and 
WNGf := U of? _4 WNG t (a), be the event that the type relevant 
to some action a is not in the set of candidate relevant types 
of that action. Finally, let CORRt := PlterCr) WNGf, be the 
event that the relevant types for all actions are in the set of 
candidate relevant types at all exploitation steps. 

We first prove several lemmas related to Theorem 1. The 
next lemma gives a lower bound on the probability of CORRt- 


Lemma 2. For RELEAF, for all a £ A, t £ r(T), we 
have ¥{INACC t {a,j)) < -ffp- for all j £ V-n{a), and 
P (CORRt) > 1 — S for any T. 


Proof: For t £ r(T), we have Ut = 0, hence 






2 log(tAD/S) 

( Ls(p n{alt )) 2 ’ 


for all a £ A, q £ Qi(f) and i £ T>. Due to the Similarity 
Assumption, since rewards in rf^^\(p- T i^ t ,pj t t),a) are 
sampled from distributions with mean between \tt{a,PK( a ) t t) — 
%s{PK(a), t ),n{a,Pii(a.), t ) + ^s(PK(a),t)}, using a Chernoff 
bound we get 


P(INACC t (a,j)) < 2exp 


(-2 (Ls(p n{a)tt )) 


2 


2 log (tAD/6) \ 
{Ls{p n(a y t )) 2 ) 
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< 28/(ADt 4 ). 


Thus, the regret in exploitation steps is 


We have WNG 4 («) C INAC C t (a,j). Thus 

P(WNG t (a)) < 25/(At 4 ), and P(WNG 4 ) < 2 5/t 4 . 
This implies that 


P(CORR^) < E P(WNG t ) 

ter{T) 


< 


E 

t 6 r(T) 


25 

— < 

t 4 - 


OO 


E 


25 

¥ 


< 5. 


Lemma 3. When CORRt happens we have for all t £ r(T) 

I rT (a \Pc t (a),t,a) - p(a,x n{a ),t)\ < 8Ls(p U ( a):t ). 

Proof: From Lemma [2] CO R R -/ happens when 

\r^ a) ^(( Pn{a ) tiVj t ),a) - 7r(o ,pn( a ) ,t)| < ^s(P7?.(a),t), 

for all a £ A, j £ V_ U ( a) , t £ r(T). Since \p(a, x-R.(a),t) - 
n(a,PK(a),t)\ < Ls(jpK( a )'t)/2, we have 

\r[ n< ' a) ' j) ((Pn{a), t ,Pj,t),a) - p(a,x n{a ^ t )\ < 2Ls(p n{a)tt ), 

(A.l) 


for all a £ A, j £ t £ r(T). Consider c*(a). Since it 

is chosen from Rel t (a) as the type with the minimum variation, 
we have on the event CORRt 


Jc t (a),k), 


'((Pc t (a),t,Pk,t),a ) - r { t ct{a) ’ j) ((p &t(aht ,p jtt ) 

< 3 Ls(p n{a )j), 

for all j, k £ X>_ gt ( a ). Hence we have 


«)l 


lE (a) (PR.(a),U a) - r C t t(a) (Pc t (a),ti a) | 

< max | | f ( TC(a )- fc )(( pK(a) t ,p k t ), a) 

_f ( t Ct{a) ’ j) ((p &t{a) ' t ,p jtt ), a )|} 

< max {| f W« ) . fc )(( pw p M ) > a ) 

+lG (c ‘(“)’ :R '( Q ))((Ps t («),t^7i(o),t) I a) 

_G (Ct(a ) J )((Rc t (a),t,Pj,t), a )|} 

< 6Ls(p n{a)tt ). (A.2) 


Combining (A.l) and (A.2), we get 

lG t(a) (Rc t (a),t,a) - Ma,Z7i(a),t)l < 8 Ls(p n{a ^ t ). 


Since for f £ r(T), a t = argmax o6-A r c t t{a) (p &t{aht , a), 

using the result of Lemma [3] we conclude that 

Pt(at) 

> p t (a*(x t )) - 8L(s(pK( at) ,t) + s(P7i(a*(a=t)),t))> 


8L E ( s (PlZ(a t ),t ) + s (Pll(a*(x t )),tj) 

tGr(T) 

< 16L E maxs(p K(a)it ) < 16L E E s (^) 

t6r(T) te-r(T)iGX> 

< 16LZ) max E 

\ter(T) 

We know that as time goes on RELEAF uses partitions with 
smaller and smaller intervals, which reduces the regret in 
exploitations. In order to bound the regret in exploitations 
for any sequence of context arrivals, we assume a worst case 
scenario, where context vectors arrive such that at each t, 
the active interval that contains the context of each type has 
the maximum possible length. This happens when for each 
type i contexts arrive in a way that all level l intervals are 
split to level l + 1 intervals, before any arrivals to these level 
l + 1 intervals happen, for all l = 0,1,2,.... This way it 
is guaranteed that the length of the interval that contains the 
context for each t £ t(T) is maximized. Let Z max be the level 
of the maximum level interval in V l {T). For the worst case 
context arrivals we must have 

^max 1 

E 2 i 2 pI <T => Z max < 1 + log 2 T/(l + p), 

1=0 



since otherwise maximum level hypercube will have level 
larger than Z max . Hence we have 


/ \ l+tog 2 T/(l+p) 

16LD max J ^ ' s(ptt) J f \QLD E 2 l 2 pl 2~ l 

1&T> \tGr(T) / 1=0 

l+log2 ^V(l+P) 

= 16 LD E 2pl < 16LL>2 2p T p/(1+p) . 
l =0 


Appendix B 
Proof of Theorem 2 


Recall that time t is an exploitation step only if U t = 0. 
In order for this to happen we need S^ g '(q,a) > D l t for 
all q £ Qi(t). There are D(D — 1) type pairs. Whenever 
action a is explored, all the counters for these D(D — 1) type 
pairs are updated for the pairs of intervals that contain types 
of contexts present at time t, i.e. q £ Q t . Now consider a 
hypothetical scenario in which instead of updating the counters 
of all q £ Q t , the counter of only one of the randomly selected 
interval pair is updated. Clearly, the exploration regret of this 
hypothetical scenario upper bounds the exploration regret of 
the original scenario. In this scenario for any p, £ V,,t, pj £ 
Vj.t, we have 




2 log (tAD/5) 

L 2 min (s(pi), s(pj)) : 


+ 1 . 


We can go one step further and consider a second hypo¬ 
thetical scenario where there is only two types i and j, for 
which the actual regret at every exploration step is magnified 
(multiplied) by D(D— 1). The maximum possible exploration 
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regret of the second scenario (for the worst case of type i 
and j context arrivals) upper bounds the exploration regret 
of the first scenario. Hence, we bound the regret of the 
second scenario. Let Z max be the maximum possible level 
for an active interval for type i by time T. We must have 
Si=o x 1 < T, which implies that Z max < 1 + log 2 T/p. 

Next, we consider all pairs of intervals for which the minimum 
interval has level Z. For each type j interval pj that has level 
Z, there exists no more than Y\ 2 A: type i intervals that 
have lengths greater than or equal to Z. Consider a level k 
type i interval pi such that Z < k < 1 + log 2 T/ p. Then for 
the pair of intervals ( pi, pj ) the exploration regret is bounded 
by (co + 1) (2log(t AD / 6) / (2~ 2k L 2 ) + l). Hence, the worst 
case exploration regret is bounded by 

( l+log 2 T/p l+log 2 T/p 

2 £ 2 ' £ 

^(* + 1 )) “ 

/II A n IX\ 1+log2 T !p 1 + 1 °Sa T /p 

4 log (tAD/S) g 2 1 g 2 3 k 

L 1=0 k—l 

l+log 2 T/p l+log 2 T/p 
+2^2'^ 2 k 
1=0 k=l 

^ 4 D 2 ( Co + l)log(tAD/S ) „ 240 ^ /o 
~ L 2 X 7 1 

64/D 2 (cq + 1) 2/p 

3 


= (c 0 + 1 )D 2 


Appendix C 
Proof of Theorem 4 

To achieve e-optimality in every exploitation step it is 
sufficient to have 

INACC t (a, j) c = j|f i (K(a) ' j) ((p- R ( o)it ,p Jit ),a) - Tt(a,Pn(a),t)\ 

c ||f^ (a)j) ((p TC(Q)it ,p Jit ),a) - ^{a,pn(a),t)\ < e} , 

for t £ r(T). This is satisfied when Z m i n > log 2 (3L/(2e)). 
Starting with level Z m ; n intervals instead of level 0 intervals 
decreases the exploitation regret of ORL-CF. Hence the regret 
bound in Theorem [7] is an upper bound on the exploitation 
regret. 

For any sequence of context arrivals, we have the following 
bound on the level of the interval with the maximum level. 


Zmax < 1 T Zmin T log2 T/p. 

Continuing similarly with the proof of Theorem [2] we have 

( l+log 2 T/p 1+log 2 T/p 

2 ^ 2 imin 2 z Y 2 Zmin 2 fc 

1=0 k=l 

2log{tAD/6) 

2-21^2-^ L 2 


= (co + l)D 


//II O/in/xt 1+'o S2 T/ p 1+log 2 T/p 
2 I 4 log (tAD/S) g 2 1 g 2 3fc 


L 2 


1=0 


k=l 


l+log 2 T/p l+log 2 T/p 

_l_2 2Jmln 2 Y^ 


1=0 


k=l 


< 2 41 min ^ 4_D 2 (cq + 1) log (tAD/S) x 240 t4/p 
, 64D 2 (cq + 1) t2/ ^ 


Appendix D 
Proof of Theorem 5 


A. Preliminaries 


Let A \= +l|. We first define a sequence of events which 
will be used in the analysis of the regret of RELEAF. For p £ 
Vn{a),u let n(a,p) = p(a,x* n{a) (p)), where x* n{a] (p) = 
{x*(pi)}i£K( a ) such that x*(pf) is the type i context at the 
geometric center of p. Let W (71(a)) be the set of D re ]-tuple 
of types such that 71(a) C w, for every w £ W (71(a)). We 
have 


\W(K{a))\ 


( D-\n{a)\ \ 
\2D mX -\K(a)\) 


For a D re ]-tuple of types w , let V(w,D') be the set of 
ZT-tuple of types whose elements are from the set 
For any w £ W(71(a)) and j £ V(w , D re i), let 


INACC t (a,w,j) := (p w t ,p j t ,a) ~ n(a,p K(a)tt )\ > 

-L\J !) Itl \ max s(P 7 ?.( a ) t) 1 ! 

Z. ie7o(a) J 


be the event that the sample mean reward of action a corre¬ 
sponding to the 2/) re |-tuple of types (w, j) is inaccurate for 
action a. Let 


ACCt(a) := f) f) INACC t(a,w,j) c 

w£ff(R(o)) jg'D(n),D re i) 

be the event that sample mean reward estimates of action a 
corresponding to all tuples (w,j) w £ W(7Z(a)) and j £ 
V(w 1 D m \) are accurate. Consider t £ t(T). Let 

WNGt(o) := (J {w i ReL(a)} 

w£W{1Z(a)) 

be the event that some D re i-tuple that contains 71(a) is 
not in the set of relevant tuples of types for action a. Let 
WNG f := Ua G ^i WNG t (o), and CORR+ := D t6T (T) WNGf, 
be the event that all Z4 rc |-tuples of types that contain the set 
of relevant contexts of each action is an element of the set 
of candidate relevant D le i-tuples types corresponding to that 
action at all exploitation steps. 

We first prove several lemmas related to Theorem 5. The 
next lemma gives a lower bound on the probability of COR R+. 

Lemma 4. For RELEAF, for all a £ A, t £ t(T), we have 
P (INACC t (a 1 w 1 j)) < =Yi 4 -for all w £ W(7l(a)), j £ 
D(w, D re i), and P (CORRt) > 1 — S for any T. 
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Proof: For t G r(T), we have = 0, hence 

S? (,) (<*,<*)> 


»(?)^„ 21og(M£>*/(S) 

(Lmin iev(q) s(p», t )) 


2 ’ 


for all a £ A, q £ QU)- Due to the Similarity 
Assumption, since for all a £ ^1, ir £ W(7£(a)) and 
3 e V(w,D le i) the rewards in r[ w,3) ((p wt ,p jt ),a) 
are sampled from distributions with mean between 
[7r(o,PTC(o),t) - ■ L ^^max ieTC(a) s(p i , t ),7r(a,p K(a)jt ) + 
L%/ 2 ^ max ieTC(a) s(pi,t)], using a Chernoff bound we get 

P (INACC t (a,w,j)) 

< 2 exp ^-2(L V / Ad max s(p itt )) 2 , 21o g(^-P 

V i£R(o) s(p i; 

< 2 S/(AD*t 4 )- 

We have 

WNGt(a) C u u INACCt (a) c . 

w^W (71(a)) jGX>(iy,D re i) 

Since the number of 2D re i-tuples that contain 7 Z(a) is 

( 2 D~ K n(a))’ Which is leSS than 0r e q ual t0 D * = ( 2 D~-l) 
since 1 < 7 Z(a) < D le \, we have 


t)) 2 


for all a £ A, w £ W(71(a)), t £ r(T). Consider c t (a). Since 
it is chosen from Rel t (a) as the D re i-tuple of types with the 
minimum variation, we have on the event CORR- r 

\f[ Ct ^ a) ’ k) ((p tt{a) t ,p k ^),a) ~ r ( t Ct(a) ’ j) ((p £t{a)it ,Pj, t ),a)\ < 

3Ly/Aei max s(pi, t ), 

iGct(a) 

for all j,k £ V(c t (a),D m \). For any w £ W(JZ(a)), let 
g(w,Ct(a)) be a 277 re i-tuple such that for all i £ w and j £ 
Ct(a), i,j £ g(w, Ct(a)). The existence of at least one such 
2Z7 re i-tuple of types is guaranteed since w and c* (a) are both 
Z7 le i-tuples of types. Hence, we have for any w £ W(7Z(a)) 

I r?(P w ,t,a) - r c t t{a) (p~ Ct{a ),t,o)\ 

{\r ( T’ k) ((p w . t ,Pk.t),a) 


< max 

k£V{w,D a i),j£V(c t (a),D al ) 

-r[ Ctia) ’ 3 \(p £t(a)!t ,p jit ),a) |} 


{V{ W ' k \(P w ,uPk,t),a) 


and 


P(WNG t (a)) < 2 6/(At 4 ), 
P(WNGt) < 2 S/t 4 . 


This implies that 


P(CORRg) < Y P(WNGt) 

ter(T) 

25 ^25 t 

ter(T) t=3 


Lemma 5. When CORRt happens we have for all t £ t(T) 

\r C t {a) (Pc t (a),t,o) - M a ; x U{a),t)\ 

< 3 L\JD re i( max s(p i}t ) + maxs(p ijt )) 

iEct(a) iZi'w 

+ 2 LyJD re i max s(pi t)- 

idU(a) 

Proof: From Lemma 0] CORRt happens when 

\r?(p w , t ,a) - t T(a,p n(a)t )\ < max s(p iit ), 

Z i£lZ(a) 

for all a £ A, w £ W (71(a)), t £ t(T). Since 

\p(a,x n{a ),t) - n{a,Pn(a),t)\ < max s(p ijt ) 

Z z£/Z(a) 


< max 
keT>{w.D„i),jeT>(c t (a),D„i) 

-f? WMa) \p g{wM a )) ,t,a)\ 

+\f° iw ’ Ct{a)) (p g[wMa)) t ,a) - r ( t Ct{a) ’ 3 ) ((p £t ( aht ,p jit ),a)\j 

< 3 L\JD re i( max s(p i}t ) + maxs(p J)i )). (A.2) 

iGct(a) 

Combining (A.l) and (A.2), we get 

\n t{a) (Pc t (a),v a ) - P(a,x n{a ),t)\ 

< 3 L\J Aei( max s(p ijt ) + maxs(p ljt )) 

iect(a) iG V 

+ 2L\jD K \ max s(p i t )- 

i&1Z(a) 


B. Regret bound for exploitations 


Since for t £ t(T), a t = argma x aGA rt t[a) (p dt ( a y t ,a), 

using the result of Lemma 0 we conclude that 

Pt(oi t ) > p t (a*(x t )) - 6L^/D k1 ( max s(p^ t ) + max s(pi. t )) 

z£c t (a) i£T> 

- 4 L\JD re i max s(p itt ), 

iGTZ(a) 

Thus, the regret in exploitation steps is bounded above by 

6 L\JD re i Y] (. max s(pij) + maxs(p iit )) 

‘ ^ -i C.n.(n\ 5 P- T) 


ter(T) 




ieT> 


+ ALy/D ve 1 V max s(p itt ) 

z ' i£lZ(a) 
f Gt(T) 

< l6L^D ie i Y max s(j>i t) 

ter(T) 

<i6L v / Ad y s (pi,t) 

t€r(T) i £V 


by the Similarity Assumption, we have 

\r?(P w ,u a ) ~ P(a,XK(a),t)\ < 2LyflY\ max s(p iit ), 


(A.l) 


< 16LD^/D le \ max s( * p ’P> 

\ter(T) 

We know that as time goes on RELEAF uses partitions with 
smaller and smaller intervals, which reduces the regret in 
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exploitations. In order to bound the regret in exploitations 
for any sequence of context arrivals, we assume a worst case 
scenario, where context vectors arrive such that at each t, 
the active interval that contains the context of each type has 
the maximum possible length. This happens when for each 
type i contexts arrive in a way that all level l intervals are 
split to level l + 1 intervals, before any arrivals to these level 
l + 1 intervals happen, for all l = 0,1,2,.... This way it 
is guaranteed that the length of the interval that contains the 
context for each t £ r(T) is maximized. Let / lnax be the level 
of the maximum level interval in Vi (T). For the worst case 
context arrivals we must have 

Zmax 1 

Y, 2l2Pl <T ^ J max < 1 + log 2 77(1 + P), 

1 =0 

since otherwise maximum level hypercube will have level 
larger than Z max . Hence, we have 

ULDy/Dja max Y'' s(p it ) 

iex> l 2 ' 

\ter(T) 

_l+log 2 T /( l + p ) 

< 16 LD\J D le i E 2 l 2 pl 2~ l 

1=0 

_l+log 2 T /( l + p ) 

= \6LDyfD~i Y 2Pl 

1=0 

< l6LD^D~ l 2 2p T p/( ~ 1+p) . 

Hence, we have Ri(T) = 0(T P ^ 1+P ^) with probability 
1 - <5. 

C. Regret bound for explorations 

Recall that time t is an exploitation step only if Ut = 0. 
In order for this to happen we need S^ q \q,a) > Di t for 
all q £ Qi{t). The number of distinct 2/J rc |-tuples of types 
is (,, / / ) ) j. Whenever action a is explored, all the counters for 
these ( 2 ^ ) type tuples are updated for the 2/J le i-tuples of 
intervals that contain types of contexts present at time t, i.e. 
q £ Qt- Now consider a hypothetical scenario in which instead 
of updating the counters of all q £ Qt, the counter of only one 
of the randomly selected 2D re i-tuple of intervals is updated. 
Clearly, the exploration regret of this hypothetical scenario 
upper bounds the exploration regret of the original scenario. 
In this scenario for any q £ Q t , we have 

S V t {q \q,a)< 2X ° g{tAD * /5) +1. 

(imitei,) s{pi)) z 

We fix a 2£> rc |-tuple of types j = (ji-j-i, ■ ■ ■ •j 2 D rel ), and 
analyze the worst-case regret due to exploration of this tuple 
of types, which is denoted by Ro,j(T). Since there are ( 9 ^ ) 
of such tuples of types, an upper bound on the exploration 
regret is (zSjRojin 

Let Z max be the maximum possible level for an active 
interval for type i by time T. We must have 1 2 pi < T, 

which implies that /max < l + log 2 r/p. Let 7 = 1 + log 2 T/p. 

First, we will consider the exploration regret incurred in 
all configurations where type j n ’s intervals has levels l n , for 


n = 1,2,..., 2D le \ such that Zi < I 2 < • • • < h r> rd - We denote 
this ordering by j* and the exploration regret in this ordering 
by Ro.j' (T). There are (2/2 rc! )! different configurations in 
which the orderings of levels of the intervals of the types are 
different. 

Let z = 2 /2 re j. Consider the tuple of intervals 
(pj*,... ,Pj* D ). The exploration regret for this tuple of 
intervals is bounded by 

(c 0 + 1) (2 log(TAD*/S)/(2~ 2lz L 2 ) + 1) . 

Hence, we have 


Ro,j*{T) < (co 4-1) 


7 7 


7 


x E 2h Y 2h ■ ■ ■ Y 2 

Zi=0 I 2 — Zi l z = lz — 1 


, f2log(TAD*/5) 


+ 1 


2 -2Z, L 2 

< (co + 1) E E 2 ‘- 1 Q(T 3 / p log T) 


7 


7 


lz=lz-1 


l z — l = lz — 2 


7 7 

^ (co + 1) E E 2^- 2 0(T a > p log T) 

lz—lz-1 l z — 2—1 z — 3 

= 0(T (2+2Ad)/p logT). 

Since Ro(T) < ( 2 ®^(2D lA )\R 0 ,j*{T), we have 

Ro{T) = 0(T( 2+2Ad )/ p logT). 


D. Balancing the regret due to exploitations and explorations 
From the results of the previous subsections we have with 
probability 1 - 6, Ri(T) = 0(T p /( 1+p )) and Ro(T) = 
0(T(2+2^= I )/p). Since Ri{T) is increasing in p and Ro{T) 
is decreasing in p there is a unique p for which they are equal. 
This unique solution is 

2 + 2Z? re i + 4D 2 el + 16/2 rc i + 12 
P ~ 4 + 2Ael + + 16AelTl2 ' 







